2016-12-02: Dimensionality reduction

Breast vs ovary cancer data

For this lab, we will work with gene expression data measured on breast and ovary tumors. The data originally comes from http://gemler.fzv.uni-mb.si/index.php but has been downsized so that it is easier to work with in our labs.

The data is similar to the Endometrium vs. Uterus cancer we have been working with for several weeks.

The data we will work with contains the expression of 3,000 genes, measured for 344 breast tumors and 198 ovary tumors.

Imports



In [ ]:

    
import numpy as np # numeric python

# scikit-learn (machine learning)
from sklearn import preprocessing 
from sklearn import decomposition



In [ ]:

    
# Graphics
%pylab inline

Loading the data

It is stored in a CSV file, small_Breast_Ovary.csv. It has the same format as the small_Endometrium_Uterus.csv file. Load the data, creating a 2-dimensional numpy array X containing the gene expression data, and an 1-dimensional numpy array y containing the labels.



In [ ]:

Question What are the dimensions of X? How many samples come from ovary tumors? How many come from breast tumors?

Principal Component Analysis

PCA documentation: http://scikit-learn.org/0.17/modules/decomposition.html#pca and http://scikit-learn.org/0.17/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

Data normalization

Remember that PCA works on normalized data (mean 0, standard deviation 1). Normalize the data.



In [ ]:

30 first principal components



In [ ]:

    
pca = decomposition.PCA(n_components=30)
pca.fit(X_norm)

Question: Plot the fraction of variance explained by each component. Use pca.explained_variance_ratio_



In [ ]:

    
# TODO

plt.xlim([0, 29])
plt.xlabel("Number of PCs", fontsize=16)
plt.ylabel("Fraction of variance explained", fontsize=16)

Question: Use pca.transform to project the data onto its principal components. How is pca.explained_variance_ratio_ computed? Check this is the case by computing it yourself.



In [ ]:

Question: Plot the data in the space of the two first components; color breast samples in blue and ovary samples in orange. What do you observe? Can you separate the two classes visually?



In [ ]:

    
for color_name, tissue, tissue_name in zip(['blue', 'orange'], [-1, 1], ['breast', 'ovary']):
    plt.scatter(#TODO, 
                c=color_name, label=tissue_name)
plt.legend(loc=(1.1, 0), fontsize=14)
plt.xlabel("PC 1", fontsize=16)
plt.ylabel("PC 2", fontsize=16)

Bonus question: Rather than visually, actually try to separate the two classes by a logistic regression line (using only the two first PCs). Plot the decision boundary. You can draw inspiration from http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py for the plot.



In [ ]:

Outliers

Question: How many outliers do you observe in your data? Identify which entries of the X matrix they correspond to, and remove them from your data.



In [ ]:

Question: Repeat the PCA procedure on the data without outliers. Can you now visually separate the two tissues?



In [ ]:

Classifying dimensionality-reduced data

Question: How many PCs do you think are sufficient to represent your data? What do you expect will happen if you use the projection of the gene expressions on these PCs and run a cross-validation of a classification algorithm? Try it out. Is there a risk of overfitting when you do this?



In [ ]:

Question: Working on the original features, how do you expect your decision boundary (and AUC) to change, for different algorithms, depending on whether or not the outliers are included in the data? Try it out.



In [ ]: